Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Updating the JIT to handle the FMA hardware intrinsics #18105

Merged
merged 3 commits into from
May 25, 2018
Merged

Updating the JIT to handle the FMA hardware intrinsics #18105

merged 3 commits into from
May 25, 2018

Conversation

tannergooding
Copy link
Member

}
else
{
// TODO-XArch-CQ: Technically any one of the three operands can
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt, will there be any problems if we set op1, op2, and op3 as regOptional? Will the JIT pick just one, or will it actually try to spill all three?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to pick just one. #6361 is the issue that tracks allowing multiple operands to be specified as regOptional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm going to update the comment with a link to the bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

{
for (var i = 1; i < RetElementCount; i++)
{
if (BitConverter.DoubleToInt64Bits(Math.Round((firstOp[i] * secondOp[i]) + thirdOp[i], 13)) != BitConverter.DoubleToInt64Bits(Math.Round(result[i], 13)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the FMA tests do the same thing, but I am going to explicitly call it out here.

We don't have a good way to test FMA today due to the way the operation is executed. Technically, the operation does:

temp = firstOp * secondOp
temp = temp + thirdOp
result = round(temp)

This is in contrast to the two individual operations which do:

temp = firstOp * secondOp
result = round(temp)

temp = result + thirdOp
result = round(temp)

This is probably the "easiest" solution right now.

@tannergooding
Copy link
Member Author

Also FYI for any reviewers. The tests are auto-generated from a template.

The first commit (~650 Lines) contains the product changes and needs actual review.

The second commit (the remaining 12k lines) contains only test changes and probably does not need in depth review (outside of checking the template input data and looking at one or two tests as an example).

@dangi12012
Copy link

I still dont understand why the rounding should be a problem. GCC does use FMA for a*b+c by default.
FMA would make Vector Dot Product much faster and increases every JIT compiled language by a lot!

@tannergooding
Copy link
Member Author

tannergooding commented May 23, 2018

I still dont understand why the rounding should be a problem. GCC does use FMA for a*b+c by default.
FMA would make Vector Dot Product much faster and increases every JIT compiled language by a lot!

@dangi12012, Only if you:

  1. Explicitly enable the FMA instruction set (via -mfma)
    and
  2. Use -O2 or higher

Additionally, as per the IEEE 754:2008, automatically transforming (a * b) + c into a fusedMultiplyAdd operation is considered a value-changing optimization. Automatically doing these types of optimizations can change the output of a given program and introduce subtle and hard to diagnose bugs.

If such automatic optimizations were provided, they would need to be behind a switch that users can enable or disable as required.

Edit: It may also be worth mentioning that GCC is the only major compiler doing this (you have to opt-in using -ffast-math or /fp:fast with other compilers): https://godbolt.org/g/6yWntG

@tannergooding
Copy link
Member Author

Rebased onto dotnet/master to pickup #18078

@Jorenkv
Copy link

Jorenkv commented May 24, 2018

I think you could test the FMA computation by applying the TwoProduct and GrowExpansion routines from this paper to compute an exact result, then round it to a single/double value: https://people.eecs.berkeley.edu/~jrs/papers/robust-predicates.pdf

@4creators
Copy link

Also FYI for any reviewers. The tests are auto-generated from a template.

The first commit (~650 Lines) contains the product changes and needs actual review.

@tannergooding Why don't you split PR into 2 commits: (i) JIT changes, (ii) Tests - to be consistent with our previous work?

@tannergooding
Copy link
Member Author

Why don't you split PR into 2 commits: (i) JIT changes, (ii) Tests - to be consistent with our previous work?

@4creators, that isn't consistent with my previous HWIntrinsic work and it isn't consistent with standard practice.

Generally speaking, you shouldn't add new product code without corresponding tests exercising said code. As, otherwise, you don't have any validation that your code is correct and functioning properly.

It is completely fine, however, to split the product code and tests into two separate commits. For discoverability, ease of review, etc.

@tannergooding
Copy link
Member Author

I think you could test the FMA computation by applying the TwoProduct and GrowExpansion routines from this paper to compute an exact result, then round it to a single/double value: https://people.eecs.berkeley.edu/~jrs/papers/robust-predicates.pdf

@Jorenkv, thanks for the reference.

There are definitely many ways in which we can "exactly" validate the result. However, this may be more effort than it is worth and I would like to hear from @CarolEidt/@eerhardt before investing significant time in doing that.

The general rule, so far, has been that the HWIntrinsics are contracted to:

  • emit a particular hardware instruction
  • not modify the inputs
  • not modify the outputs

There are of course certain exceptions, such as the "helper" intrinsics (which don't map to any particular instruction) or certain intrinsics (like CompareEqualOrderedScalar) whose instructions return one or more flags, which have to be converted to some form of boolean expression instead.

The current validation logic ensures that we are emitting some form of MultiplyAdd instruction without worrying (too much) about the underlying implementation/behavior of said instruction (we are basically just validating we didn't encode the instruction wrong).

@4creators
Copy link

that isn't consistent with my previous HWIntrinsic work and it isn't consistent with standard practice

OK So as usual you are contradicting yourself just to have different opinion 😄, AFAIR you explicitly asked me to submit PRs as two separate commits: (i) commit changing product code -> JIT, (ii) commit adding tests. Should I dig up your specific request?

@tannergooding
Copy link
Member Author

OK So as usual you are contradicting yourself just to have different opinion 😄, AFAIR you explicitly asked me to submit PRs as two separate commits: (i) commit changing product code -> JIT, (ii) commit adding tests. Should I dig up your specific request?

Yes, two commits (which is exactly the shape of the current PR). I may have misread your original comment as I thought you were asking why this isn't split into two PRs.

@4creators
Copy link

@tannergooding Ahh, OK I see how this issue arised.

@tannergooding
Copy link
Member Author

Ok, test failures are all in the no AVX runs.

FMA failures are because the ISA isn't filtered out when AVX is disabled (which, today, basically controls the VEX support that FMA requires).

The remaining failures look to have been introduced by #16517 (the last non-PR run doesn't yet include my most recent changes to BuildHWIntrinsic: https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x86_checked_windows_nt_jitx86hwintrinsicnoavx/130/).

They all look to be similar in error: Assertion failed '(consume == 0) || (ComputeAvailableSrcCount(tree) == consume)' -- CC. @CarolEidt

@tannergooding
Copy link
Member Author

Rebased onto master to pick up #18116
Fixed up the compiler to no longer enable FMA if the global AVX flag is also disabled.

@tannergooding
Copy link
Member Author

Logged https://github.com/dotnet/coreclr/issues/18119 for the remaining, unrelated, failures.

@tannergooding
Copy link
Member Author

PR resolving the other test failures is here: #18120

@tannergooding
Copy link
Member Author

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

@tannergooding
Copy link
Member Author

@CarolEidt, all issues should now be resolved.

@Jorenkv
Copy link

Jorenkv commented May 25, 2018

@tannergooding Ah of course. You're not doing software fallback so then there's no point in checking if the computation is precisely correct.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through all the code, and have just a couple of comment/naming suggestions - as well as a couple of things to consider for future.
I plan to look at the encodings (instrxarch.h) just to double-check against the arch manual.

@@ -2606,6 +2582,34 @@ void Compiler::compSetProcessor()
}
}
}

// There are currently two sets of flags that control AVX, FMA, and AVX2 support
// This is the general EnableAVX flag and the individual ISA flags. We need to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would end the first line with ':' and start the second line with "These are ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -5360,12 +5455,85 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber reg,
}
}

void emitter::emitIns_SIMD_R_R_R_A(
instruction ins, emitAttr attr, regNumber reg, regNumber reg1, regNumber reg2, GenTreeIndir* indir)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this (and subsequent), I would name the first reg argument targetReg or dstReg or something. Unlike some of the other methods, this one is designed only to support the case where the first argument is the dest, so it would be good to be descrptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I submit a separate PR fixing up all the register names here? Currently the majority are regNumber reg where it should be regNumber targetReg and it would be nice to fix them all up (I could also do it in this PR, if you think that is fine).

@@ -17422,6 +17422,16 @@ bool GenTree::isRMWHWIntrinsic(Compiler* comp)
switch (AsHWIntrinsic()->gtHWIntrinsicId)
{
case NI_SSE42_Crc32:
case NI_FMA_MultiplyAdd:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior. As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics and then adding HW_Flag_RMWSemantics. But that is a topic for another discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior.

I don't think we have that many of them (there are very few RMW intrinsics for the VEX encoding). I'll add a comment indicating we should revisit if this grows any more.

As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics and then adding HW_Flag_RMWSemantics. But that is a topic for another discussion.

Right, I think we have an issue tracking this. -- I think the primary problem right now is that the majority case is positive, and it is easier to track/add the minority case (whether positive or negative).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO-XArch-Cleanup comment here.

varNum = tmpDsc->tdTempNum();
offset = 0;

compiler->tmpRlsTemp(tmpDsc);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible comment for future: if there isn't a method to do this (I couldn't find it off-hand), perhaps there should be (returning a varNum):

tmpDsc = getSpillTempDsc(op3);
varNum = tmpDsc->tdTempNum();
compiler->tmpRlsTemp(tmpDsc);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO-XArch-Cleanup comment

@CarolEidt
Copy link

@fiigii - would you have time to also have a look at this?

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks! And sorry it took me a while to get to this.

@fiigii
Copy link

fiigii commented May 25, 2018

@CarolEidt Sorry, I have no time to look at the PR in detail until 6/10. Please feel free to move forward if it looks good to you.

@tannergooding tannergooding merged commit 8db778b into dotnet:master May 25, 2018
@CarolEidt
Copy link

@fiigii - no problem, but thanks for the quick response.

@tannergooding tannergooding deleted the hwintrin-fma branch May 30, 2018 04:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the HWIntrinsics for FMA
6 participants